library(kableExtra)
library(tidyverse)
library(dplyr)
library(ggpubr)
library(lattice)
library(lme4)

Motivation

[LRJ: Once we finalize the case study, we should come back and revisit the intro/motivation section]

There are over 1400 public schools in Maryland and there are also outstanding students in each school. The federal Every Student Succeeds Act (ESSA) prompted states to develop long term plans to improve schools through accountability and innovation, which sets Maryland’s schools on the path to continuous improvement. Maryland Report Card website aims to share the most current information available to help stakeholders understand and measure student achievement in all 24 local school systems. Here is a message from the State Superintendent of Schools that enables you to know more about how it functions.

[source: https://www.youtube.com/watch?time_continue=8&v=g5ylQgdisTM]

Our analysis is inspired by this plan - figure out how well schools were performing and how different factors influenced their performance. Once we indentify schools that need improvements and influential factors, both Maryland’s government and local organizations can prompt actions and provide necessary support, in a way that is understanable and reliable.

The libraries used in this study are listed in the following table, along with their purpose in this particular case study:

Library Purpose
kableExtra Helps with building common complex tables and manipulating table styles; creates nice-looking HTML tables
tidyverse A coherent system of packages for data manipulation, exploration and visualization
dplyr Helps you solve the most common data manipulation challenges

In order to run this code please ensure you have these packages installed.

The learning objectives for this case study include:

What is the data?

The data files were downloaded from the Maryland Report Card website. Data files can be found by clicking Data Download at the bottom of the web page.

At this link, you can find data for all 1400+ schools in Maryland from 2003 to 2018. For this case study, we will focus on data on high schools from 2017. For this analysis, we will extract data from 5 of the files for that year:

Data Import

After downloading the 5 data files, we can write a script to import all the files in the same folder into R relatively easily. In this case, we’ve saved all files in a folder named Data:

Knowing which environment you are in is necessary since it enables R to find these files quickly and accurately. Since our data files are in the folder Data, we will need to specify this in the path to each file.

Instead of importing each file with an individual line of code, we can use the function lapply() to import them all simultaneously. To do this, first we make a vector of all of the names for the files we want to import. The list.files() function will make a vector of all files in a folder that match a given pattern. Here we want all .csv files, so we can use pattern="*.csv" to specify all files that end with a .csv. The * character means any set of characters can be in that location in the file name.

files = list.files(path='./Data',pattern="*.csv")
files
## [1] "Cohort_Grad_Rate_2016.csv"         "PARCC_2017.csv"                   
## [3] "Special_Services_2017.csv"         "Student_Mobility_2017.csv"        
## [5] "Wealth_Expenditures_Data_2017.csv"

Next we create a vector of paths to these files by pasting the folder name, Data, to the path for each file.

filePaths = paste0("./Data/", files)
filePaths
## [1] "./Data/Cohort_Grad_Rate_2016.csv"        
## [2] "./Data/PARCC_2017.csv"                   
## [3] "./Data/Special_Services_2017.csv"        
## [4] "./Data/Student_Mobility_2017.csv"        
## [5] "./Data/Wealth_Expenditures_Data_2017.csv"

Finally, we can read in each data file. The lapply() function will apply the same function repeatedly to each item in a vector and store the results in a list. The function we want to apply repeatedly is the read_csv() function; by giving the vector of file paths and this function to lapply(), we can ready in all 5 data files at once.

data <- lapply(filePaths, read_csv)

Now we have a data object that consists of a list of 5 data sets.

data

We can access an individual data set with the [[ ]] operator.

data[[1]]

Looking at the data, we can see that many of the column names have spaces, which means that we will have to put ticks around the names to refer to them, for example: Class of Year. We can replace all the spaces in the names with . using the make.names() function:

names(data[[1]])
##  [1] "Class of Year"         "LEA"                  
##  [3] "LEA Name"              "School Number"        
##  [5] "School Name"           "Cohort"               
##  [7] "Grad Rate"             "Diplomas Earned"      
##  [9] "Adjusted Cohort Count" "Create Date"
make.names(names(data[[1]]))
##  [1] "Class.of.Year"         "LEA"                  
##  [3] "LEA.Name"              "School.Number"        
##  [5] "School.Name"           "Cohort"               
##  [7] "Grad.Rate"             "Diplomas.Earned"      
##  [9] "Adjusted.Cohort.Count" "Create.Date"

We can do this for each of our 5 data sets using a for loop:

for (i in 1:5) {
  names(data[[i]]) <- make.names(names(data[[i]]))
  data[[i]] <- mutate_if(data[[i]], is.character, as.factor)
}

Data Wrangling

Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham

[LRJ: is the quote from a book? then we need to cite the book and not just the author]

A large part of the case study involves data wrangling to get these five data files into a single data set that we can analyze. To do this, we will alternate between viewing the data and performing data wrangling steps. In this case-study, we will provide screenshots of the data; you can view the data in the same way by using the View() function on the data frame we are working with.

Graduation Rate

The first data file (Cohort_Grad_Rate_2017.csv) records the percentage of students in each school who received a Maryland high school diploma during 2017. Let’s look at the data frame that comes from this file:

View(data[[1]])
dim(data[[1]])
## [1] 878  10

This data frame has a total of 878 rows with 10 variables. Looking at this data frame in R, we see there are several issues we will need to address, such as possible missing values designated by *. We will deal with these issues in later steps.

Selecting and renaming variables

Our first step is to subset to only school-level data. We notice that if School.Number equals A, then this row corresponds to county-level data: this row gives the graduation rate across all of the high-schools in the county.

We also notice that there are two rows of data for each school giving two different graduation rates for the school. These graduation rates are for different cohorts of students:

  • 5-year adjusted cohort graduation rate: the percentage of a school’s cohort of first-time 9th grade students who graduate within five years, adjusted for students who transfer in and out of the cohort after 9th grade.
  • 4-year adjusted cohort graduation rate: the percentage of a school’s cohort of first-time 9th grade students who graduate within four years, adjusted for students who transfer in and out of the cohort after 9th grade.
  • 3-year adjusted cohort graduation rate: the percentage of a school’s cohort of first-time 9th grade students who graduate within three years, adjusted for students who transfer in and out of the cohort after 9th grade.

We notice big differences in the values for these graduation rates – the 3-year rate is generally quite low and the 5-year rate is generally higher, while the 4-year rate is generally in the 80-100% range we might expect for a graduation rate. Since high school students typically graduate after four years, it makes more sense to use the 4-year adjusted cohort as our measure of a school’s graduation rate.

We will use the functions filter() and select() from the dplyr package to deal with these two issues.
We can use the filter() function to filter out the county-level data (School.Number != 'A')) and filter out the 5-year and 3-year cohort graduation rate (Cohort == '4-year adjusted cohort'). We can use the select() function to select only the variables we need from this dataset: LEA.name, School.Number, School.Name, and Grad.Rate. We can rename these variables while we select them to give them names that will be faster to work with in R.

df_grad <- data[[1]] %>%
 filter(School.Number != 'A', 
        Cohort == '4-year adjusted cohort') %>%
 select(county=LEA.Name, sch.num=School.Number,
         sch=School.Name, grad.rate=Grad.Rate)

Look at our improved dataset! Only school level information and useful variables are kept, leaving us a tidier dataset. Now, we need to address the remaining missing values in the grad.rate variable.

Missing Value

There are a lot of missing values in the dataset, which would influence further analysis if we leave them there. From the first plot, missing values exist in different forms:

  • '>= 95.00','<= 5.00' : The graduation rate is higher than 95% or lower than 5%.
  • '*' : This schools’ graduation rate is simply not present in the data.

Use function summary() to produce the descriptive statistics of these missing values.

summary(df_grad[, c('sch', 'grad.rate')])
##                                          sch         grad.rate  
##  Chesapeake High                           :  2   >= 95.00: 50  
##  Frederick Douglass High                   :  2   *       : 29  
##  Northwestern High                         :  2   <= 5.00 :  8  
##  Aberdeen High                             :  1   66.67   :  2  
##  Academy for College and Career Exploration:  1   88.57   :  2  
##  Academy of Health Sciences at PGCC        :  1   89.17   :  2  
##  (Other)                                   :258   (Other) :174

Pay attention to the first kind of missing value. It is impossible to calculate the exact graduate rate because of the lack of information, so we decide to use gsub() function to replace incomplete values. gsub() function replaces all matches of a string. Elements of string vectors which are not substituted will be returned unchanged.

Without knowledge of exact value, 2.5 and 97.5 are used to replace ‘<= 5.00’ and ‘>= 95.00’ respectively. By the way, graduation rate lower than 5% sounds abnormal - such as Extended Day Learning Program and Home Assignments-Secondary. Actually they mainly provide special education and special needs school programs, which may be designed for adults - explaining the low graduation rate.

df_grad$grad.rate<- gsub('<= 5.00','2.5', df_grad$grad.rate)
df_grad$grad.rate <- gsub('>= 95.00','97.5', df_grad$grad.rate)

The second type of missing value (’*’) is confusing, let’s extract them out and have a look:

head(df_grad[df_grad$grad.rate == '*', ])
## # A tibble: 6 x 4
##   county           sch.num sch                      grad.rate
##   <fct>            <fct>   <fct>                    <chr>    
## 1 Anne Arundel     1274    Marley Glen School       *        
## 2 Anne Arundel     3414    Ruth Parker Eason School *        
## 3 Anne Arundel     4304    Central Special School   *        
## 4 Baltimore County 0075    Crossroads Center        *        
## 5 Baltimore County 0111    Maiden Choice School     *        
## 6 Baltimore County 0922    Ridge Ruxton             *

If we search Marley Glen School, we may found it mainly provides program for students with disabilities. Also, Ruth Parker Eason School provides a special education program for students with moderate to severe disabilities. So we guess schools with ’*‘graduation rate mainly provide education for students with moderate to severe disabilities. We can either delete it or replace them as ’NA’. Considering we need to merge multiple files in later steps, such kind of information may be dropped out automatically. Ths most appropriate method for us is replacing them as ‘NA’ and deal with them all together.

df_grad$grad.rate <- na_if(df_grad$grad.rate, '*')
df_grad$grad.rate <- as.numeric(df_grad$grad.rate)

Unique id

Our last step is to extract information from several files and merge them together, requiring a key that specifies each observation uniquely. Possible choices in present dataset can be sch.num or sch.name but we need to check whether each of them appear once in the dataset. If not, they are not suitable choice because of the lack of uniqueness.

summary(df_grad[,c('sch.num', 'sch')])
##     sch.num                                            sch     
##  0301   :  4   Chesapeake High                           :  2  
##  0102   :  3   Frederick Douglass High                   :  2  
##  0207   :  3   Northwestern High                         :  2  
##  0405   :  3   Aberdeen High                             :  1  
##  0705   :  3   Academy for College and Career Exploration:  1  
##  0108   :  2   Academy of Health Sciences at PGCC        :  1  
##  (Other):249   (Other)                                   :258

From the summary, whether sch.num or sch.name, the frequency of some values is larger than 1. Therefore it is definitely important to define a unique element by ourselves.

df_grad <- df_grad %>%
    within( id <- paste(county, sch, sep = '-'))

The combination of county and sch generates a new variable - id. To check its uniqueness, we use function table(), which builds a contingency table of the counts at each combination of factor levels. The as.data.frame() function converts the array-based representation of a contingency table to a data frame containing two variable: each factor Var1and its frequency Freq.

tab <- as.data.frame(table(df_grad$id))
tab[tab$Freq > 1,]
## [1] Var1 Freq
## <0 rows> (or 0-length row.names)

Up to now, a unique key element id is generated and we finally create the ideal dataset - a tidy dataset.

PARCC

Partnership for Assessment of Readiness for College and Careers (PARCC) reflects schools’ academic achievements through assessing the performance of students on state standardized tests, such as English and Math. It not only provides information about students mastery of state standards, but also offers teachers and parents with timely information to inform instruction and how to provide support. From this file, we want to extract the percentage of students scoring ‘high performance’, which works as the target variable is data analysis part.

colnames(data[[2]])
##  [1] "Academic.Year"                                   
##  [2] "LEA.Number"                                      
##  [3] "LEA.Name"                                        
##  [4] "School.Number"                                   
##  [5] "School.Name"                                     
##  [6] "Assessment"                                      
##  [7] "Tested.Count"                                    
##  [8] "Level.1..Did.not.yet.meet.expectations...Count"  
##  [9] "Level.1..Did.not.yet.meet.expectations...Percent"
## [10] "Level.2..Partially.met.expectations...Count"     
## [11] "Level.2..Partially.met.expectations...Percent"   
## [12] "Level.3..Approached.expectations...Count"        
## [13] "Level.3..Approached.expectations...Percent"      
## [14] "Level.4..Met.expectations...Count"               
## [15] "Level.4..Met.expectations...Percent"             
## [16] "Level.5..Exceeded.expectations...Count"          
## [17] "Level.5..Exceeded.expectations...Percent"        
## [18] "Create.Date"

PARCC is a large file with 8989 observations and 18 variables. We notice there are five levels of performance indicators, ranging from Level 1 to Level 5. First, we use percentage information rather than count information. Second, we will create a new variable, pro, that indicates the percentage of students performing at the ‘met expectations’ and ‘exceeded expectations’ levels.

Feature Selection and Rename

Similiar to graduation rate file, only school level information and necessary variables (LEA.Name, School.Number, School.Name, Assessment and five level indicators) will be kept and renamed.

df_parcc <- data[[2]] %>%
  filter( School.Number != 'A' ) %>%
  select(3,4,5,6,9,11,13,15,17)

colnames(df_parcc) <- c('county','sch.num', 
                        'sch', 'subject','L1',
                        'L2','L3','L4','L5')

Variable subject reflects the kind of subject test. English/Language Arts Grade 10, Algebra 1 and 2 are filtered out by function filter() in package dplyr. You are encouraged to keep other assessments that you are interested in.

df_parcc <- df_parcc %>%
  filter(subject %in% c('English/Language Arts Grade 10','Algebra 1','Algebra 2'))

Missing Value

Missing value is the toughest problem in this wrangling part. Let’s see the summary of these five levels first.

summary(df_parcc[, 5:9])
##        L1            L2            L3            L4            L5     
##  <= 5.0 :273   <= 5.0 :176   <= 5.0 :127   <= 5.0 :151   <= 5.0 :597  
##  17.6   :  7   28.6   :  9   33.3   : 10   33.3   :  7   6.8    :  7  
##  5.5    :  7   15.2   :  8   22.7   :  8   75     :  7   19.8   :  5  
##  15.8   :  6   16.7   :  7   25     :  8   11.4   :  5   5.6    :  5  
##  5.6    :  6   17.9   :  7   14.3   :  7   17.6   :  5   5.1    :  4  
##  6.7    :  6   18.2   :  7   19.7   :  7   34.1   :  5   8.6    :  4  
##  (Other):569   (Other):660   (Other):707   (Other):694   (Other):252

There is only one kind of missing value: <=5.0 in these 5 levels indicators. We will replace it as NA, then transform them to numerical data.

df_parcc[,5:9] <- 
  lapply(df_parcc[,5:9], function(x) as.numeric(as.character(x)))

We only care about the proportion of L4 and L5 but NAs are distributed differently. To simply wrangling steps, we reduce these five levels into two new levels : ‘Level weak’ (L1, L2 and L3) and ‘Level excellent’ (L4 and L5). Then deal with NAs in three ways. If values of ‘Level excellent’ are complete, then pro equals to sum of L4 and L5. Such as Algebra 1 assessment of ‘Washington Middle’, its pro should be \(85.5 + 9.7 = 95.2\); If values of ‘Level weak’ all exist, pro equals to 100 percent minus the sum of L1, L2 and L3. For example, the pro of Algebra 1 for ‘Fort Hill High’ equals to \(100-23.9-44.6-21.7 = 9.8\);

df_parcc$pro <- NA

indx1 <- !is.na(df_parcc$L4) &  !is.na(df_parcc$L5)
df_parcc[indx1, 'pro'] <- rowSums(df_parcc[indx1,8:9])

sum(is.na(df_parcc$pro))
## [1] 597
indx2 <- !is.na(df_parcc$L1) &  !is.na(df_parcc$L2) & !is.na(df_parcc$L3)

df_parcc[indx2, 'pro'] <-  100-rowSums(df_parcc[indx2,5:7])

sum(is.na(df_parcc$pro)) 
## [1] 203

Function is.na() indicates which elements are missing and returns a boolean index of the same shape as the original data frame. df_parcc[indx1,] and df_parcc[indx2,] specify observations have complete information of ‘Level excellent’ and ‘Level weak’ respectively. sum(is.na(df_parcc$pro)) shows the counts of NA in variable pro. Undoubtedly, our methods reduce NAs greatly after we apply corresponding methods mentioned above.

If missing values exist both in these two levels, we will first use 100 percent minus existing values. Then divide the difference to the number of missing value and replace NA with this quotient. For instance - ‘Brooklyn Park Middle’, there are three NAs, so \(NA = (100-20.8-72.7)/3=2.2\), and pro equals to \(72.7+2.2=74.9\). The reason we don’t replace these NAs as 2.5 directly is that there is a restriction we don’t want to obey - the sum of 5 levels equals to 100 percent. It is more reasonable to assume missing values are uniformly distributed than replace them with a fixed value directly.

indx3 <- is.na(df_parcc$pro)


na_sum <- 100-rowSums(df_parcc[indx3, 5:9], na.rm = TRUE)
n <- rowSums(is.na(df_parcc[indx3, 5:9]))
na <- round(na_sum/n,1)

for (i in 1:length(indx3)){
  df_parcc[indx3, 5:9][i,which(is.na(df_parcc[indx3, 5:9][i,]))] <- na[i]
}

df_parcc[indx3, 'pro'] <- rowSums(df_parcc[indx3,8:9])
sum(is.na(df_parcc$pro)) 
## [1] 0

What the above script does? First, rows containing NA in pro are filtered out and recorded as indx3, totally 874 observations. Then, for each row, we calculate the difference of 100 percent and sum of existing values - na_sum, and the number of missing value - n. Dividing na_sum by n gives the quotients na. All of these three variables are vectors which length are the same as indx3 - 203. The last step is replacement and we write a for loop here. For each row (df_parcc[indx3, 5:9][i,]), columns containing NA are selected out with the help of function is.na(). Applying function which() enables us to get the corresponding columns indexs. Once we determine the positions of NA, we can assign what we calculated before - na to these NAs.

Now, we have 0 NAs, which proves the validity and feasibility of our methods. The following step is to create a unique id - still use the combination of county name and school name. And we only keep new variable pro in the final dataset

pac <- df_parcc %>%
  within( id <- paste(county, sch, sep = '-')) %>%
  select(county, sch.num, sch, subject, pro, id)

This is how the final parcc dataset looks like:

What’s more, the last two rows - The Seed School of Maryland, lacks information about which county it belongs to.

tail(pac)
## # A tibble: 6 x 6
##   county    sch.num sch            subject           pro id                
##   <fct>     <fct>   <fct>          <fct>           <dbl> <chr>             
## 1 Baltimor… 454     Carver Vocati… Algebra 2        1    Baltimore City-Ca…
## 2 Baltimor… 480     Baltimore Cit… English/Langua… 45.1  Baltimore City-Ba…
## 3 Baltimor… 480     Baltimore Cit… Algebra 1       27.7  Baltimore City-Ba…
## 4 Baltimor… 480     Baltimore Cit… Algebra 2       16.7  Baltimore City-Ba…
## 5 SEED      1000    The Seed Scho… English/Langua… 38.4  SEED-The Seed Sch…
## 6 SEED      1000    The Seed Scho… Algebra 1        9.80 SEED-The Seed Sch…

After check online, we find it is not a traditional high school but a boarding school that draws students from all over the state of Maryland. So it wouldn’t be right to assign them to any particular county. We choose to exclude them from our analysis.

pac <- pac %>%
  filter( county != 'SEED')

Extract Assessment Information

The proportion of high performance in these assessment (\(p\)) plays an role of response variable in our later analysis. Thus, splicting them out first makes further analysis convenient.

\[ p_{ela} = \beta_0+\beta_{1} x_1+ \beta_{2} x_2 \]

df_ela <- pac[grep("English/Language Arts Grade 10", pac$subject), ]
df_alg1 <- pac[grep("Algebra 1", pac$subject), ]
df_alg2 <- pac[grep("Algebra 2", pac$subject), ]
dim(df_ela)  #size of df_ela
## [1] 233   6
dim(df_alg1) #size of df_alg1
## [1] 472   6
dim(df_alg2) #size of df_alg2
## [1] 167   6

It’s wired! The size of df_alg1 is larger than the other two so let’s look at it:

head(df_alg1)
## # A tibble: 6 x 6
##   county  sch.num sch                 subject    pro id                    
##   <fct>   <fct>   <fct>               <fct>    <dbl> <chr>                 
## 1 Allega… 405     Fort Hill High      Algebra…  9.80 Allegany-Fort Hill Hi…
## 2 Allega… 406     Washington Middle   Algebra… 95.2  Allegany-Washington M…
## 3 Allega… 504     Braddock Middle     Algebra… 84.4  Allegany-Braddock Mid…
## 4 Allega… 601     Center for Career … Algebra…  7.6  Allegany-Center for C…
## 5 Allega… 606     Allegany High       Algebra…  8.4  Allegany-Allegany High
## 6 Allega… 802     Westmar Middle      Algebra… 87.9  Allegany-Westmar Midd…

It is because it contains middle school information! Actually for Algebra 1, students will take the exam the year they learn it so that’s why both middle school and high school have this test. Besides, pro of middle schools are much higher than that of high schools. It is because students who take Algebra1 test in middle schoos are more likely to be good at this subject. This fact makes them incompariable with students who take Algebra 1 test in high schools. pro for high school can only reflect part of the academic achievements - those who may not be good at Algebra 1. Pay attention to this point in later analysis.

How to extract the high school information? Here variable id comes to work! df_grad only includes information for high schools. So if obseravations in df_alg1 also exist in df_grad, it corresponds to high school, otherwise to middle school. The unique variable id enables us to check the co-existness with the help of function filter().

df_alg1 <- df_alg1 %>%
  filter(id %in% df_grad$id)

dim(df_alg1)
## [1] 222   6

Special Services

Special_Services_2017.csv records the number and percentage of students who applied different special service and students approved through direct certification. Such variables make us know more about the school quality and students support. Here we choose service FARMS: receive free or reduced price meals.

What the following script does is similiar to previous wrangling steps except one step. Since in special service file there is a vairable called School.Type, we can filter high school information easily if we add a new condition School.Type == 'High'.

df_spc <- data[[3]] %>%
  filter( School.Number != 'A', School.Type == 'High' ) %>%
  within( id <- paste(LEA.Name, School.Name, sep = '-')) %>%
  select(LEA.Name, School.Number, School.Name, 
           FARMS.Pct, id)

colnames(df_spc) <- c('county', 'sch.num', 'sch', 'farms', 'id')

summary(df_spc)
##               county      sch.num   
##  Baltimore City  :41   0301   :  4  
##  Prince George's :37   0102   :  3  
##  Baltimore County:33   0207   :  3  
##  Montgomery      :31   0405   :  3  
##  Anne Arundel    :18   0705   :  3  
##  Howard          :14   0108   :  2  
##  (Other)         :94   (Other):250  
##                                          sch          farms    
##  Chesapeake High                           :  2   *      : 12  
##  Frederick Douglass High                   :  2   <= 5.0 :  7  
##  Northwestern High                         :  2   35.1   :  3  
##  Aberdeen High                             :  1   55.4   :  3  
##  Academy for College and Career Exploration:  1   12.5   :  2  
##  Academy of Health Sciences at PGCC        :  1   13.8   :  2  
##  (Other)                                   :259   (Other):239  
##       id           
##  Length:268        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

There is only one problem in this dataset, also it is an old problem - missing value.

Missing Value

Similiarly, let’s have a look on what these schools with ’*’ in farms look like:

df_spc[df_spc$farms == '*',]
## # A tibble: 12 x 5
##    county      sch.num sch                 farms id                        
##    <fct>       <fct>   <fct>               <fct> <chr>                     
##  1 Anne Arund… 1274    Marley Glen School  *     Anne Arundel-Marley Glen …
##  2 Anne Arund… 2233    Anne Arundel Eveni… *     Anne Arundel-Anne Arundel…
##  3 Baltimore … 0077    BCDC Educational C… *     Baltimore County-BCDC Edu…
##  4 Calvert     0206    Calvert Country Sc… *     Calvert-Calvert Country S…
##  5 Carroll     0712    Carroll Springs Sc… *     Carroll-Carroll Springs S…
##  6 Carroll     0717    Post Secondary Pro… *     Carroll-Post Secondary Pr…
##  7 Howard      0522    Cedar Lane Special… *     Howard-Cedar Lane Special…
##  8 Montgomery  0799    Stephen Knolls Sch… *     Montgomery-Stephen Knolls…
##  9 Montgomery  0951    Longview School     *     Montgomery-Longview School
## 10 Prince Geo… 2217    Incarcerated Youth… *     Prince George's-Incarcera…
## 11 Wicomico    0520    Wicomico County Ev… *     Wicomico-Wicomico County …
## 12 Baltimore … 0884    Eager Street Acade… *     Baltimore City-Eager Stre…

It seems that they are still not ‘normal’ public schools so we change ’*’ to NA.

After replacing ‘<= 5.0’ to ‘2.5’ with function gsub(), if you apply as.numeric() function, ’*’ will be transformed to NA automatically.

df_spc$farms <- gsub('<= 5.0','2.5', df_spc$farms)
df_spc$farms <- as.numeric(df_spc$farms)

This is how the final special serivce dataset looks like:

Student Mobility

Student_Mobility_2017.csv includes information of the movement of students from one school to another during the school year. There are 3 types of mobility provided: total student mobility, entry mobility, and exit mobility. To some extent, exit mobility reflects students’ satisfaction with their schools so we would like to use it as variable withdraw in our analysis.

As usual, wrangling steps are high school observation selection, unique id creation, feature selection and missing value transformation.

df_mob <- data[[4]] %>%
  filter( School.Number != 'A', School.Type == 'High' )%>%
  within( id <- paste(LEA.Name, School.Name, sep = '-')) %>%
  select(LEA.Name, School.Number, 
           School.Name, Withdrawals.Pct, id)

colnames(df_mob) <- c('county', 'sch.num', 'sch', 'withdraw', 'id')

summary(df_mob)
##               county      sch.num   
##  Baltimore City  :41   301    :  4  
##  Prince George's :37   102    :  3  
##  Baltimore County:33   207    :  3  
##  Montgomery      :31   405    :  3  
##  Anne Arundel    :18   705    :  3  
##  Howard          :14   108    :  2  
##  (Other)         :94   (Other):250  
##                                          sch         withdraw  
##  Chesapeake High                           :  2   <= 5.0 : 77  
##  Frederick Douglass High                   :  2   >= 95.0:  9  
##  Northwestern High                         :  2   5.5    :  8  
##  Aberdeen High                             :  1   8.6    :  5  
##  Academy for College and Career Exploration:  1   10.3   :  3  
##  Academy of Health Sciences at PGCC        :  1   10.9   :  3  
##  (Other)                                   :259   (Other):163  
##       id           
##  Length:268        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
df_mob$withdraw <- gsub('<= 5.0','2.5', df_mob$withdraw)
df_mob$withdraw <- gsub('>= 95.0','97.5', df_mob$withdraw)
df_mob$withdraw <- as.numeric(df_mob$withdraw)

This is how the final mobility dataset looks like:

Wealth Expenditures

Wealth_Expenditures_Data_2017.csv contains information of different kinds of investments as well as wealth per pupil. What makes it special is that its size (11 variables and 25 observations) is obsivously smaller than previous files. Let’s have a look on it:

It turns out that this file only contains county level data! The last row is state level ‘All Public School’ which needs to will be deleted when we merge datasets. Then, we would like to keep variable county which will work as the key in later merge steps. Besides, a new variable exp will be created by the formula:

\[exp = \frac{Wealth.Per.Pupil}{Expenditures.Per.Pupil}\]

The quotient of Wealth.Per.Pupil and Expenditures.Per.Pupil captures the image of each county’s financial condition. Function mutate() in package dplyr is a nice tool to adds new variables. Thus, wrangling steps are new variable creation, feature selection and rename.

df_exp <- data[[5]] %>%
  mutate(exp = round(Wealth.Per.Pupil/Expenditures.Per.Pupil, 1)) %>%
  select(LEA.Name, exp)

colnames(df_exp) <- c('county','exp')

We notice the last row is information about ‘All Public School’, but it will be dropped automatically when we merge it.

Final Dataset

Relational Datasets

Now we have totally 8 datasets, requiring combination to answer the questions that we are interested in. They are called relational data because it is their relations, not just the individual datasets, that are important. Relations are defined between a pair in these datasets. In out case, relations presented in the form that the same variable exists in multiple datasets. The following plot describes how our datasets connect:

We can see that variable county exists in all datasets, variable sch, sch.name and id exists in all datasets except df_exp. In addition, each dataset contain its unique variables.

To work with relational data, mutating joins is pretty powerful which combines variables from two datasets. It first matches observations by their keys, then copies across variables from one dataset to the other. keys is a variable that uniquely indentifies an observation so it can connect each pair of datasets. This is the meaning of the existence of variable id - sufficiently indentifies each observation in our datasets.

mutating joins includes inner joins and outer joins. An inner join keeps observations that appear in both datasets while outer join keeps observations that appear in at least one of the datasets. There are three types of outer joins: left join, right join and full join. The graphical explanation is given below:

Inner join:

Outer join:

To be more specific, a table is provided to show detailed descriptions.

Types How it works Implementation in dplyr
inner join keeps observations that appear in both datasets inner_join()
left join keeps all observations in x but drops bbservations in y but not in x left_join()
right join keeps all observations in y but drops bbservations in x but not in y right_join()
full join keeps all observations in x and y full_join()

Left join is the most common and popular method, especially when we only need part of variables from another dataset. It preserves the original observations even when there isn’t a match - just make it as NA. This is also our best choice and we will show how it works in our case.

Example

The final dataset should include predict variables grad.rate, withdraw, farms, exp and target variable pro. Three datasets with this structure will be created for df_ela, df_alg1 and df_alg2 respectively. As mentioned, we would focus on left join with the help of function left_join(). Let’s see an example for df_ela first:

temp <- df_ela %>%
   select(id, county, pro) %>%
  left_join(df_grad[, c('grad.rate', 'id')], by = 'id') 

head(temp)
## # A tibble: 6 x 4
##   id                                  county         pro grad.rate
##   <chr>                               <fct>        <dbl>     <dbl>
## 1 Allegany-Fort Hill High             Allegany      40.9      85.0
## 2 Allegany-Allegany High              Allegany      52.9      90.1
## 3 Allegany-Mountain Ridge High School Allegany      54.2      88.6
## 4 Anne Arundel-Glen Burnie High       Anne Arundel  37.9      88.8
## 5 Anne Arundel-North County High      Anne Arundel  49.5      87.0
## 6 Anne Arundel-Severna Park High      Anne Arundel  83.1      94.9
tail(temp)
## # A tibble: 6 x 4
##   id                                            county        pro grad.rate
##   <chr>                                         <fct>       <dbl>     <dbl>
## 1 Baltimore City-Augusta Fells Savage Institut… Baltimore …  2.4       56.4
## 2 Baltimore City-Coppin Academy                 Baltimore …  9.30      93.3
## 3 Baltimore City-Renaissance Academy            Baltimore …  3.2       66.7
## 4 Baltimore City-Frederick Douglass High        Baltimore …  3.2       59.4
## 5 Baltimore City-Carver Vocational-Technical H… Baltimore …  2.80      84.0
## 6 Baltimore City-Baltimore City College         Baltimore … 45.1       94.9

Look! grad.rate in df_grad was added to df_ela successfully. Option by= enables you to specify the keys you would like to use. Then, we repeat this merge step for df_mob, df_spc and df_exp.

Dataset Variable Kept Keys
df_grad grad.rate: 4-year graduation rate id
df_mob withdraw: students’ exit mobility id
df_spc farms: percentage of students reciving free/reduced price meals id
df_exp exp: quotient of wealth and expenditures county
temp1 <- df_ela %>%
  select(id, county, pro) %>%
  left_join(df_grad[, c('grad.rate', 'id')], by = 'id') %>%
  left_join( df_mob[, c('withdraw', 'id')], by = 'id') %>%
  left_join( df_spc[, c('farms', 'id')], by = 'id') %>%
  left_join(df_exp, by = 'county')

Instead of repeating by ourselves, the R base package provides a function Reduce(), which can come in handy. Reduce() takes a function \(f\) of two arguments and a list or vector \(x\) which is to be reduced using \(f\). The function is first called on the first two components of \(x\), then with the result of that as the first argument and the third component of \(x\) as the second argument, then again with the result of the second step as first argument and the fourth component of \(x\) as the second argument etc. The process is continued until all elements of \(x\) have been processed.

L_ela <- list(df_ela[, c('id', 'county', 'pro')], 
              df_grad[, c('id','grad.rate')], 
              df_mob[, c('id','withdraw')], 
              df_spc[, c('id','farms')], df_exp)

temp2 <- Reduce(function(x,y) left_join(x, y, 
                                        by = colnames(y)[1]), L_ela)
# `by = colnames(y)[1])` tells how to find the *keys* in each left_join. 

The above script shows how Reduce() function works in out case. First, these five files are save in list L_ela. Then function left_join() was applied one by one in list L_ela with function Reduce(). temp2 is the same dataset as temp1. So, let’s write it as a function and apply it to df_alg1 and df_alg2.

Apply Function

func_merge <- function(data){
L <- list(data[, c('id', 'county', 'pro')], 
              df_grad[, c('id','grad.rate')], 
              df_mob[, c('id','withdraw')], 
              df_spc[, c('id','farms')], df_exp)
  
  final_data<- Reduce(function(x,y) left_join(x, y, 
                                        by = colnames(y)[1]), L)
  return(final_data)
}

df_ELA <- func_merge(df_ela)
df_ALG1 <- func_merge(df_alg1)
df_ALG2 <- func_merge(df_alg2)

Wonderful! Finally we get the ideal datasets: df_ELA, df_ALG1 and df_ALG2.

Exploratory data analysis

Before we move forward to the data analysis, let’s first try data visualization to examine the distribution of our variables, and their relationship; This step can be crucial for model application.

Response Variable

The variable we care about most is the response variable pro. As we mentioned, it is a multilevel dataset so it is inspiring to check if there is variation in pro between- and within- counties.

# Boxplot
p1 <- ggplot(df_ALG1, aes(x = reorder(county, pro), 
  y = pro,fill = county)) + geom_boxplot()+
  guides(fill = FALSE) + theme_bw() + 
    theme(axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5)) + 
  labs( y = 'Distribution', 
        caption = 'proportion of high performance for Algebra 1 in year 2017')

ALG1_sum <- df_ALG1 %>% 
  group_by(county) %>% 
  summarise(mean = mean(pro),
            sd = sd(pro))
# Line plot
p2 <- ggplot(ALG1_sum, aes( reorder(factor(county), mean), mean, group = 1)) +
  geom_line(alpha = 1/3) + theme_bw()+
  geom_pointrange(aes(ymax = mean + sd, ymin = mean - sd)) +
  labs(caption = 'Error bars') + 
    theme(axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))

ggarrange(p1, p2, ncol = 2)

Arguments used and their functions are given below:

  • reorder(county, pro): Display the bars for each county in ascending order based on pro.
  • guides(fill = FALSE): Remove legend for a particular aesthetic.
  • theme_bw(): Display the theme with a white background.
  • theme(): Modify a single plot’s theme by specifying different theme elements. Elements we use include
    • axis.title.x = element_blank(): Remove labels of x axis.
    • axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5)): Horizontal, vertical and angle justification for text of x axis.
  • labs(y = , caption = ): Change axis labels and legend titles.
  • summarise(): Used on grouped data created by group_by(). The output will have one row for each group. Values we would like to use are mean and standard error of pro for each county.
  • geom_line(alpha = 1/3): Connect observations, ordered by x value. alpha = can modify colour transparency.
  • geom_pointrange: Draw a vertical interval defined by ymin and ymax.
  • ggarrange(): Arrange the subfigures into one plot.

Both of them indicate that pro varies a lot: there appears to be within-county variation in as indicated by the size of the boxes. Variation in the means across counties illustrate between-county variation ranging from 0 to 50. How about the other two datasets? Does the variation still holds there?

Above scripts is used to defind a function distplot. A new function can be created to avoid repetitive work, especially when the work is lengthy and cumbersome.

distplot <- function(data, var, var.name){
  p <- ggplot(data, aes(x = reorder(county, var), 
  y = var, fill = county)) + geom_boxplot()+
  guides(fill = FALSE) + theme_bw() + 
    theme(axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))+
  labs( y = paste("Distribution of ", var.name), 
        caption = deparse(substitute(data)))
  # specify differenct variables and datasets used here
  return(p)
}

p3 <- distplot(df_ALG1, df_ALG1$pro, "pro")
p4 <- distplot(df_ALG2, df_ALG2$pro, "pro")
p5 <- distplot(df_ELA, df_ELA$pro, "pro")


ggarrange(p3, p4, p5, nrow = 2, ncol = 2)

We notice that for the second plot, it lacks two counties information: Baltimore County and Dorchester. Still google online and it returns: Algebra 2 test is not mandatory for students in these two conties so there isn’t information of their performance.

Within-county variation in df_ELA is the most evident as indicated by the box size of Kent and that of Baltimore County; Between-county variation in df_ALG2 is the largest, ranging from 0 to 90. Remember that students who take Algebra1 test in high schoos may not be good at it so that’s why the general distribution of pro for Algebra 1 is lower than that for Algebra 2.

Predictor Variable

In addition to the response variable, distribution of predictor variables can also be informative. Let’s apply function distplot on them and we still use df_ALG1 as an example.

p6 <- distplot(df_ALG1, df_ALG1$grad.rate, "grad.rate")
p7 <- distplot(df_ALG1, df_ALG1$farms, "farms")
p8 <- distplot(df_ALG1, df_ALG1$withdraw, "withdraw")
p9 <- distplot(df_ALG1, df_ALG1$exp, "exp")

ggarrange(p6,p7, p8, p9, nrow = 2, ncol = 2)

Distribution for exp and farms fits the routine of pro: variation across counties is obvious. But grad.rate and withdraw are distributed askew and the variation is tiny.

Replationship between Response Variable and Predictor Variable

Before the data gets ready for actual analysis, it needs to undergo some more steps. What we expect is a linear relationship between predictors and response so let’s plot their relationship.

relaplot1 <- function(var, var.name){
  
  p <- ggplot(df_ALG1, aes(x = var, y = pro))+    
  geom_point()+ geom_smooth(se = FALSE)+
  theme(legend.position = "none") +
    labs(x = var.name)
  return(p)
}

relaplot2 <- function(var, var.name){

  p <- ggplot(data = df_ALG1, aes(x = var , y = pro,group = county))+    
  facet_wrap( ~ county, switch = 'x', nrow = 3)+
  geom_point(aes(colour = county))+ 
  geom_smooth(se = FALSE, aes(colour = county))+
  theme(legend.position = "none") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank()) +
    labs(x = var.name)
  return(p)
}

Function relaplot1() is used to plot the general relationship while relaplot2() is able to capture differences between counties as a result of facet_wrap(). Function facet_wrap() in package ggplot wraps a 1d sequence of panels into 2d. Arguments used and their functions are given below:

  • geom_smooth(se = FALSE): Add a smooth regression line and hide confidence interval around smooth line.
  • switch = 'x': Display the top labels to the bottom.
  • nrow=3: Specify the number of rows in panels.
  • theme(): Modify a single plot’s theme by specifying different theme elements. Elements we use include
    • legend.position = "none": Do not display legend.
    • axis.title = element_blank(): Remove labels of x and y axis.
    • axis.text = element_blank(): Remove texts of x and y axis.

Use farms and grad.rate as examples:

p10 <- relaplot1(df_ALG1$farms, "farms")
p11 <- relaplot2(df_ALG1$farms, "farms")
p12 <- relaplot1(df_ALG1$grad.rate, "grad.rate")
p13 <- relaplot2(df_ALG1$grad.rate, "grad.rate")
ggarrange(p10,p11,p12,p13, 
          ncol = 2, nrow = 2, widths = c(3,9))

The overall relationship between farms and pro can be considered linear. However, for grad.rate, it is not surprising to see non-linear relationship - actually the skewed distribution plot (p6) foreshadows this condition. And withdraw is also positively skewed. Such skew distribution will cause the gradients have bias towards ‘strange’ values, making outliers more important. To saturate the effects of skewness, common transformation can be applied such as log transformation.

df_ALG1 <- df_ALG1 %>%
  mutate(grad.rate.log = log(grad.rate)) %>%
  mutate( withdraw.log = log(withdraw))

p14 <- relaplot1(df_ALG1$grad.rate, " grad.rate")
p15 <- relaplot1(df_ALG1$grad.rate.log, "log grad.rate")
p16 <- relaplot1(df_ALG1$withdraw, " withdraw")
p17 <- relaplot1(df_ALG1$withdraw.log, "log withdraw")

ggarrange(p14,p15,p16,p17,
          ncol = 2, nrow = 2)

We see that log transformation is effective on withdraw, but the performance on grad.rate is even worse! Why we got this result? It is because of the properties of log transformation - it “pulls in” more extreme values on the right (high values) relative to the left (low values). So it is not always the case that log can perform effectively.

To deal with the special condition, we would like to firstly use 100 minus the grad.rate and record it as non.grad.rate. Then apply log transformation, which guarantees a nicer ‘original’ distribution of grad.rate as well as a better result of log transformation.

df_ALG1 <- df_ALG1 %>%
  mutate(non.grad.rate = -grad.rate+100) %>%
  mutate( non.grad.rate.log = log(non.grad.rate))

p18 <- relaplot1(df_ALG1$non.grad.rate.log, "log non.grad.rate")
ggarrange(p14,p15,p18,ncol = 3)

Nice! Now we get an ideal variable - log.non.grad.rate, reducing the influence of skew distribution. What we need to pay attention to is how to interpret the result: one unit decrease in non.grad.rate equals to one unit increase in grad.rate.

Data analysis

To evaluate influence of each predictor variable, linear regression may first comes to your mind. Let’s fit a simple linear regression model here:

m <- lm(pro ~ non.grad.rate.log + withdraw.log + farms + exp, data = df_ALG1)
summary(m)
## 
## Call:
## lm(formula = pro ~ non.grad.rate.log + withdraw.log + farms + 
##     exp, data = df_ALG1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.905  -7.831  -1.672   7.186  63.034 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       51.39580    4.43986  11.576  < 2e-16 ***
## non.grad.rate.log -2.16358    1.90002  -1.139   0.2561    
## withdraw.log      -4.25472    2.10261  -2.024   0.0443 *  
## farms             -0.44928    0.08127  -5.528 9.37e-08 ***
## exp               -0.02212    0.09345  -0.237   0.8131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.01 on 214 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.543,  Adjusted R-squared:  0.5344 
## F-statistic: 63.56 on 4 and 214 DF,  p-value: < 2.2e-16

The result shows that only farms and withdraw.log are significant while others are not. It also indicates that exp has a negative relationship with pro: higher expenditure leads to lower performance in Algebra 1, which violates our common sense. The result may be not reliable so we decide to apply a more suitable model - Multilevel Model.

Multilevel Model Introduction

Nevertheless, what makes our dataset different is its nested data structure: units are clustered into subgroups that are also nested within larger groups. We use a plot to indicate our dataset’s structure:

As shown above, the dataset can be devided into two levels: county level (L1) and school level (L2). Such structure is common in human and biological science. For instance, in educational settings, students are clustered into schools and schools are clustered nto districts, which are part of a larger system (county, state or country). The subgroup-specific characteristics may have some potential influence on outcomes. Another typical example can be longtitude data, a variable’s values over time are correlated with each other. In this case, linear regression fail to capture the influence of clustering so we need a more nuanced analysis.

Multilevel models (MLM) are statistical models of parameters that vary at more than one level. To be specific, a two-level model allows for variation in both school level and county level, which cannot be accounted for in traditional ordinary least squares regression. County level variation, also called ‘county effects’, represents a source of dependence that influence schools’ performance. MLM always includes fixed and random parameters. Fixed parameters are composed of a constant over all the groups, whereas a random parameter has a different value for each of the groups.

This section will go over how to run MLM for our dataset in R, and focus on estimating fixed and random parameters as well as how to interpret the output. We will start from a fairly simple and then fir slightly more complex models, as the primary goal is to find a proper model and interpretation rather than an exhaustive review of highly complex MLM models. This section will also include a short review of different library options for MLM analyses.

Simple Models

Let’s look at an example and examine a model that just estimates a random intercept. This simple model asks the question, ‘Is there variability in Algebra 1 achievement across county?’

\[y_{ij} = \alpha_{j} + \beta_j*x_{ij}+\varepsilon_{ij}\]

Variables Definition
\(y_{ij}\) the dependent variable for the \(i\) school observation in county \(j\)
\(\alpha_{j}\) the intercept of the dependent variable in county \(j\)
\(x_{ij}\) the \(i\) school observation in county \(j\)
\(\beta_j\) the slope for the relationship in county \(j\) between \(x_{ij}\) and reponse variable
m1 <- lmer(pro ~ 1 + 1|county, data = df_ALG1)

Function lmer() in package lme4 is a powerful tool to fit MLM. It allows you to analyze data using restricted maximum likelihood estimation (REML) rather than ordinary least squares (OLS), making it ideal for MLM analyses. Arguments for it includes:

  • formula: lmer(formula = RV ~ Fixed_Factor + (Random_intercept + Random_Slope | Random_Factor), data = ,) where RV is the response variable; Fixed_Factor means all groups have the same values; Random_intercept and Random_Slope refers to intercepts and slopes are different in the different groups, and that each have their own overall mean and variance; Random_Factor refers to grouping factors.
  • data =: the dataset used in analysis

By setting fixed factor as 1, ther eare no predictors included in m1 - no fixed parameters; 1|county means we only estimate a random intercept. This intercept represents the average Algebra 1 achievement across counties. Then use summary() to display the result:

summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: pro ~ 1 + 1 | county
##    Data: df_ALG1
## 
## REML criterion at convergence: 1841.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2154 -0.4789 -0.2913  0.3276  5.6352 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  county   (Intercept) 156.8    12.52   
##  Residual             198.0    14.07   
## Number of obs: 222, groups:  county, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   24.372      2.901   8.403

In the random effects part, it measures how much variability in the dependent variable pro there is due to the random effect county. The variance of county is 156.81- how much variance there is between counties in Algebra 1 achievement. The variance of residual is 198.03 - stands for the variability that’s not due to county. To partition the variance, we take the variance component of the intercept and divide it by the sum of the residual variance and the intercept variance. The result means county-level factors explain 44.19% of the variance.

The fixed effects part provides model intercepts that are not particularly meaningful in MLM. It is the average for all counties in our dataset.

Complex Models

We would like to see how certain variables influence the response variable, so complex models come to our mind. We will first start from applying univariate models and try multivariate models in later steps.

Univariate Models

Let’s start! We would first estimate the effect of non.grad.rate.log on Algebra 1 achievement:

m2 <- lmer(pro ~ non.grad.rate.log + (1|county), data = df_ALG1)
summary(m2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: pro ~ non.grad.rate.log + (1 | county)
##    Data: df_ALG1
## 
## REML criterion at convergence: 1710.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8385 -0.6236 -0.1115  0.3834  5.3305 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  county   (Intercept)  93.89    9.69   
##  Residual             123.37   11.11   
## Number of obs: 219, groups:  county, 24
## 
## Fixed effects:
##                   Estimate Std. Error t value
## (Intercept)        46.0608     2.9140   15.81
## non.grad.rate.log -10.4176     0.8878  -11.73
## 
## Correlation of Fixed Effects:
##             (Intr)
## nn.grd.rt.l -0.634

Pay attention to random effects, compared to m1 without the fixed effect, the variance of county - 93.89 decreased considerably. This is because the variation that’s due to non.grad.rate.log was confounded with the variation that’s due to county. Also, m1 doesn’t know about variation that’s associated with non.grad.rate.log so its predictions was relatively more off, leading to relatively larger residuals. Now that the effect of non.grad.rate.log is specified, a considerable amount of the variance that was previously in the random effects component (differences between non.grad.rate.log) is shifted to the fixed effects component.

In the fixed effects, estimated coefficient for non.grad.rate.log is -10.42, which is also the slope for the effect of non.grad.rate.log. It means that to 1 unit decrease(1%) in the non.grad.rate.log, there is 10.42 increase in pro. Then, there’s a standard error associated with this slope, and a t- value, which is simply the estimate divided by the standard error.

Then we allow non.grad.rate.log to be a random effect and try to see their difference:

m3 <- lmer(pro ~ non.grad.rate.log + (non.grad.rate.log |county), data = df_ALG1,
           control = lmerControl(optimizer ="Nelder_Mead"))
summary(m3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: pro ~ non.grad.rate.log + (non.grad.rate.log | county)
##    Data: df_ALG1
## Control: lmerControl(optimizer = "Nelder_Mead")
## 
## REML criterion at convergence: 1694.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9903 -0.6050 -0.1017  0.4346  5.4966 
## 
## Random effects:
##  Groups   Name              Variance Std.Dev. Corr 
##  county   (Intercept)       278.16   16.678        
##           non.grad.rate.log  13.94    3.733   -1.00
##  Residual                   114.45   10.698        
## Number of obs: 219, groups:  county, 24
## 
## Fixed effects:
##                   Estimate Std. Error t value
## (Intercept)         49.636      4.140   11.99
## non.grad.rate.log  -12.479      1.139  -10.96
## 
## Correlation of Fixed Effects:
##             (Intr)
## nn.grd.rt.l -0.937
## convergence code: 0
## singular fit

In contrast to the model above, 1|county was updated to non.grad.rate.log|county. Thus, we have specified both non.grad.rate.log and county to be random effects. This will enable the slopes for non.grad.rate.log predicting Algebra 1 achievement to vary from county to county. Additionally, optimizer ="Nelder_Mead" specifies the optimizer we would like to use. From EDA we find the relationship between predictor variables and response variable is non-linear. Nelder_Mead is often applied to nonlinear optimization problems, so it is a good choice when we include random slope.

In the random effects, we now get an estimate of the variance component associated with the non.grad.rate.log, which tells us about the variability in the non.grad.rate.log on Algebra 1 achievement due to counties. The value is 13.94, and we get a singular fit warning in the bottom. Singularity leads to random-effect variance estimates of (nearly) zero, or estimates of correlations that are almost -1 or 1. Look at the Corr section, it tells us whether the random effects of Algebra 1 achievement due to county are related to the random effects of the slopes of the non.grad.rate.log predictor. The correlation value is -1, proving the model’s singularity.

In all, even though the residual variance decreases, the model is overfitting and non.grad.rate.log should not be included in random part. Actually if we look back to EDA section, the plot of relationship between non.grad.rate.log and pro (p13) has foretold the result here - there is not an obvious difference of their relationship within different counties.

Except non.grad.rate.log, we also show the relationship between farms and pro in EDA. Therefore we may apply the above models on variable frams and try to detect the difference.

m4 <- lmer(pro ~ farms+ (1 |county), data = df_ALG1)
summary(m4)
## Linear mixed model fit by REML ['lmerMod']
## Formula: pro ~ farms + (1 | county)
##    Data: df_ALG1
## 
## REML criterion at convergence: 1720.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0770 -0.5402 -0.0974  0.3904  6.6309 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  county   (Intercept)  78.45    8.857  
##  Residual             113.86   10.670  
## Number of obs: 222, groups:  county, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 46.90891    2.71094    17.3
## farms       -0.60340    0.04643   -13.0
## 
## Correlation of Fixed Effects:
##       (Intr)
## farms -0.641

Note that in both the variance of county and residual decrease greatly compared with m2, which means m4 performs better. And the estimated coefficient for farms is -0.6, which is also the slope for the effect of farms. It means that to one unit increase(1%) in farms, there is -0.6 decrease in pro.

Now let’s try make farms a random effect :

m5 <- lmer(pro ~ farms + (farms |county), data = df_ALG1,
           control = lmerControl(optimizer ="Nelder_Mead"))
summary(m5)
## Linear mixed model fit by REML ['lmerMod']
## Formula: pro ~ farms + (farms | county)
##    Data: df_ALG1
## Control: lmerControl(optimizer = "Nelder_Mead")
## 
## REML criterion at convergence: 1713.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1839 -0.5434 -0.1178  0.3648  6.9456 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  county   (Intercept) 185.18702 13.6083       
##           farms         0.02543  0.1595  -0.91
##  Residual             108.64442 10.4233       
## Number of obs: 222, groups:  county, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 48.20314    3.66960   13.14
## farms       -0.65252    0.06129  -10.65
## 
## Correlation of Fixed Effects:
##       (Intr)
## farms -0.866

NOTE:figure out two correlations

Similiar to the performance of m3, the residual variance in m5 decreases compared to m4 but not that much. In all, random slopes may not be a good choice in our condition and it is mainly because the unbalanced data - the number of observations for each county is not same and some counties even just has one observation.

So far, we haven’t talked about significance yet. [1] mentioned: “p values for the corresponding F and t tests are not provided by the lme4 package. The reason is connected with the fact that generally the exact null distributions for the parameter estimates and test statistics are unknown.” But this can be remedied by package lmerTest. Let’s load it:

library(lmerTest)
m6 <- lmer(pro ~ farms + (farms |county), data = df_ALG1,
           control = lmerControl(optimizer ="Nelder_Mead"))
summary(m6)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pro ~ farms + (farms | county)
##    Data: df_ALG1
## Control: lmerControl(optimizer = "Nelder_Mead")
## 
## REML criterion at convergence: 1713.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1839 -0.5434 -0.1178  0.3648  6.9456 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  county   (Intercept) 185.18702 13.6083       
##           farms         0.02543  0.1595  -0.91
##  Residual             108.64442 10.4233       
## Number of obs: 222, groups:  county, 24
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 48.20314    3.66960 18.14852   13.14 1.04e-10 ***
## farms       -0.65252    0.06129 14.48348  -10.65 3.07e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## farms -0.866

Have you seen the difference? In the fixed effects, we get information of degree of freedom and p value. Small p value indicates that both intercept and farms is significant. Based on p-value, farms is significant.

Multivariate

After gettting familiar with MLM, let’s try to find a multivariate model with all predictors:

m7 <- lmer(pro ~ non.grad.rate.log + withdraw.log + farms + exp + 
             ( 1 |county), data = df_ALG1, na.action = 'na.exclude',
           control = lmerControl(optimizer ="Nelder_Mead"))
summary(m7)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## pro ~ non.grad.rate.log + withdraw.log + farms + exp + (1 | county)
##    Data: df_ALG1
## Control: lmerControl(optimizer = "Nelder_Mead")
## 
## REML criterion at convergence: 1678.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0778 -0.5504 -0.0672  0.4004  6.1848 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  county   (Intercept)  76.59    8.751  
##  Residual             107.31   10.359  
## Number of obs: 219, groups:  county, 24
## 
## Fixed effects:
##                    Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)        45.96125    6.22282  37.92365   7.386 7.53e-09 ***
## non.grad.rate.log  -3.77603    1.63507 202.35268  -2.309   0.0219 *  
## withdraw.log       -1.78219    1.80596 202.85863  -0.987   0.3249    
## farms              -0.37983    0.07516 212.49868  -5.054 9.33e-07 ***
## exp                 0.10608    0.15669  31.95347   0.677   0.5033    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) nn.g.. wthdr. farms 
## nn.grd.rt.l -0.040                     
## withdraw.lg -0.083 -0.640              
## farms       -0.102 -0.286 -0.385       
## exp         -0.897 -0.044  0.088  0.015

In the random effects, residual variance is 108.64, which is the smallest value in above tries. Here county-level factors explain 41.65% of the variance. Admittedly, the decrease in residuals is not that large compared with m6. It is because previous transformation strench out their range to create a linear and balanced distribution. The variance that’s due to graduation rate and withdraw rate consolidate simutaneously, resulting in a tiny decrease.

Then look at the fixed effects part: based on the p value, intercept, farms and non.grad.rate.log are significant and the left variables should be dropped. Here question comes, do we really want to exclude withdraw.log and exp? P value is not a strict criterior we should rely on blindly. What we should do is to interpret the results and make decision based on an informed judgement.

Assumption Check

The assumption for Multilevel Model includes:

  • the variables are related linearly to response variable
  • the residuals are independent
  • the residuals have constant variance
  • the residuals are normally distributed

The linearity assumption can be simply checked by ploting the residuals and predictors. We will use function resid() to call for the residuals of model, then use function plot() to do graphic check:

par(mfrow = c(2, 2)) 
resi <- resid(m7)
plot(resi, df_ALG1$farms)
plot(resi, df_ALG1$non.grad.rate.log)
plot(resi, df_ALG1$withdraw.log)
plot(resi, df_ALG1$exp)

Any non-random patterns may be a sign of assumption violation so we satisfy the linear assumption. In addition, previous feature transformation, square root and log transformation, established the status.

In our dataset, each row corresponds to a specific school so our observtions are independent, so the residuals are independent too. Next we will use the plot() function to check the assumption for constant variance. It can produce a residual plot if you put a lmer() returned object in.

plot( m7)

From the residual plot, it does not indicate any deviations from a linear form. It also shows relatively constant variance as there isn’t any evident patterns across the fitted range. The slight reduction in variance on the left part maybe a result of the skewed distribution of our response variable. As shown from the distribution plot in EDA part, the mean of pro in both cases is below 30.

Last, the normality asspumtion can be checked with the help of function qqnorm() and qqline():

qqnorm(resid(m7))
qqline(resid(m7))

There is some slight deviation from the expected normal line towards the heads and tails, but overall points fall nicely on the line!

Fit Other Dataset

Until now, we use df_ALG1 as an example and m7 is our ideal model. Then apply it on other datasets and compare the results.

df_ALG2

Summary

Note: what kind of county level data can we include?

[1] Kuznetsova A, Brockhoff PB, Christensen RHB. LmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 2017;82.